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ABSTRACT 

Motivated by emerging big streaming data processing paradigms 
(e.g., Twitter Storm, Streaming MapReduce), we investi¬ 
gate the problem of scheduling graphs over a large cluster 
of servers. Each graph is a job, where nodes represent com¬ 
pute tasks and edges indicate data-flows between these com¬ 
pute tasks. Jobs (graphs) arrive randomly over time, and 
upon completion, leave the system. When a job arrives, 
the scheduler needs to partition the graph and distribute 
it over the servers to satisfy load balancing and cost con¬ 
siderations. Specifically, neighboring compute tasks in the 
graph that are mapped to different servers incur load on 
the network; thus a mapping of the jobs among the servers 
incurs a cost that is proportional to the number of “broken 
edges”. We propose a low complexity randomized scheduling 
algorithm that, without service preemptions, stabilizes the 
system with graph arrivals/departures; more importantly, it 
allows a smooth trade-off between minimizing average par¬ 
titioning cost and average queue lengths. Interestingly, to 
avoid service preemptions, our approach does not rely on 
a Gibb’s sampler; instead, we show that the corresponding 
limiting invariant measure has an interpretation stemming 
from a loss system. 
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1. INTRODUCTION 

In recent years, a new computing model - stream pro¬ 
cessing - is gaining traction for large-scale cloud computing 
systems. These systems [15|25|29|32| are driven by real time 
and streaming data applications. For instance, consider the 
computation needed to answer the question: How may times 
does the hashtag “#sigmetrics2015” appear in Twitter over 
the next two hours? The key feature here is that the data is 
not (yet) in a database; instead it is appearing as and when 
people tweet this hashtag. Applications of such stream com¬ 
puting are in many domains including social network ana¬ 
lytics and e-commerce. 

To address such stream processing, the emerging compu¬ 
tation model of choice is that of graph processing. A compu¬ 
tation is represented by a graph, where nodes in the graph 
represent either data sources or data processing (and oper¬ 
ate sequentially on a stream of atomic data units), and edges 
in the graph correspond to data flows between nodes. To ex¬ 
ecute such computations, each node of a graph is mapped 


to a machine (server/blade) in a cloud cluster (data center), 
and the communication fabric of the cloud cluster supports 
the data flows corresponding to the graph edges. A canoni¬ 
cal example (and one of the early leaders in this setting) is 
Twitter’s Storm [^, where the (directed) graph is called a 
“topology”, an atomic data unit is a “tuple”, nodes are called 
“spouts” or “bolts”, and tuples flow along the edges of the 
topology. We refer to for additional discussion. 

From the cloud cluster side, there are a collection of ma¬ 
chines interconnected by a communication network. Each 
machine can simultaneously support a finite number of graph 
nodes. This number is limited by the amount of resources 
(memory/processing/bandwidth) that is available at the ma¬ 
chine; in Storm, these available resources are called “slots” 
(typically order of ten to fifteen per machine). Graphs (cor¬ 
responding to new computations) arrive randomly over time 
to this cloud cluster, and upon completion, leave the cluster. 
At any time, the scheduling task at the cloud cluster is to 
map the nodes of an incoming graph onto the free slots in 
machines to have an efficient cluster operation. As an ex¬ 
ample, the default scheduler for Storm is round-robin over 
the free slots; however, this is shown to be inefficient, and 
heuristic alternatives have been been proposed [^. 

In this paper we consider a queueing framework that mod¬ 
els such systems with graph arrivals and departures. Jobs 
are graphs that are dynamically submitted to the cluster and 
the scheduler needs to to partition and distribute the jobs 
over the machines. Once deployed in the cluster, the job (a 
computation graph) will retain the resources for some time 
duration depending on the computation needs, and will re¬ 
lease the resources after the computation is done (i.e., the 
job departs). The need for efficient scheduling and dynamic 
graph partitioning algorithms naturally arises in many par¬ 
allel computing applications [O ; however, the theoreti¬ 


cal studies in this area are very limited. To the best of our 
knowledge, this is the first paper that develops models of 
dynamic stochastic graph partitioning and packing, and the 
associated low complexity algorithms with provable guaran¬ 
tees for graph-based data processing applications. 

From an algorithmic perspective, our low complexity al¬ 
gorithm has connections to the Gibbs sampler and other 
MGMC (Monte Carlo Markov Chain) methods for sampling 
probability distributions (see for example i)- In the setting 
of scheduling in wireless networks, the Gibb’s sampler has 
been used to design CSMA-like algorithms for stabilizing the 
network 17 [W 23 2^. However, unlike wireless networks 
where the solutions form independent sets of a graph, there 
is no natural graph structure analog in the graph partition- 











ing. The Gibbs sampler can still be used in our setting by 
sampling partitions of graphs, where, each site of the Gibbs 
sampler is a unique way of partitioning and packing a graph 
among the machines in the cloud cluster. The difficulty, 
however, is that there are an exponentially large number of 
graph partitions, leading to a correspondingly large number 
of queues. The second issue is that a Gibbs sampler poten¬ 
tially can interrupt ongoing service of jobs. The analog of a 
service interruption in our setting is the migration of a job 
(graph) from one set of machines to another in the cloud 
cluster. This is an expensive operation that requires saving 
the state, moving and reloading on another set of machines. 

A novelty of our algorithm is that we only need to main¬ 
tain one queue for each type of graph. This substantial re¬ 
duction is achieved by developing an efficient method to ex¬ 
plore the space of solutions in the scheduling space. Fur¬ 
ther, our low complexity algorithm performs updates at ap¬ 
propriate time instances without causing service interrup¬ 
tions. In summary, our approach allows a smooth trade-off 
between minimizing average partitioning cost and average 
queue sizes, by using only a small number of queues, with 
low complexity, and without service interruptions. As it will 
become clear later, the key ingredient of our method is to 
minimize a modified energy function instead of the Gibbs 
energy; specifically, the entropy term in the Gibbs energy is 
replaced with the relative entropy with respect to a proba¬ 
bility distribution that arises in loss systems. 

1.1 Related Work 

Dynamic graph scheduling occurs in many computing set¬ 
tings such as Yahool’s S4 [^, Twitter’s Storm [^, IBM’s 
InfoSphere Stream [^, TimeStream [22| , D-Stream [32| , 
and online MapReduce [^. Current scheduling solutions in 
this dynamic setting are primarily heuristic [2) |16|[^ . 

The static version of this problem (packing a collection 
of graphs on the machines on a one-time basis) is tightly 
related to the graph partitioning problem [^[^, which is 
known to be hard. There are several algorithms (either 
based on heuristics or approximation bounds) available in 
the literature [ilMli®- 

More broadly, dynamic bin packing (either scalar, or more 
recently vector) has a rich history [^[^, with much recent 
attention [11||27|[^ . Unlike bin packing where single items 
are placed into bins, our objective here is to pack graphs in 
a dynamic manner. 

1.2 Main Contributions 

We study the problem of partitioning and packing graphs 
over a cloud cluster when graphs arrive and depart dynami¬ 
cally over time. The main contributions of this work can be 
summarized as follows. 

• A Stochastic Model of Graph Partitioning. We de¬ 
velop a stochastic model of resource allocation for graph- 
based applications where either the computation is repre¬ 
sented by a graph (Storm [29| , InfoSphere Strea m [15| ) or 
the data itself has a graph structure (GraphLab [13| , Gi- 
raph |12| ). Most efforts have been on the systems aspects, 
while employing a heuristic scheduler for graph partition¬ 
ing and packing. One of the contributions of this paper 
is the model itself which allows an analytical approach 
towards the design of efficient schedulers. 

• Deficiencies of Max Weight-type Algorithms. The 


dynamic graph partitioning problem can be cast as a net¬ 
work resource allocation problem; to illustrate we describe 
a frame-based Max Weight algorithm that can jointly sta¬ 
bilize the system and minimize packing costs. However, 
such Max Weight-type solutions have two deficiencies: 

(1) they involve periodically solving the static graph par¬ 
titioning problem (NP-hard in general); thus there is little 
hope that this can be implemented in practice, 

(2) they require periodic reset of the system conhguration 
to the Max Weight configuration; this interrupts a signif¬ 
icant number of ongoing computations or services of the 
jobs in the system and require them to be migrated to 
new machines (which is expensive). 

• Low Complexity Algorithms without Service In¬ 
terruptions. We develop a new class of low complexity 
algorithms, specifically targeted for the stochastic graph 
partitioning problems, and analytically characterize their 
delay and partitioning costs. In particular, the algorithms 
can converge to the optimal solution of the static graph 
partitioning problem, by trading-off delay and partition¬ 
ing cost (a tunable parameter). Equally important, this 
class of algorithms do not interrupt the ongoing services in 
the system. The algorithms rely on creating and removing 
templates, where each template represents a unique way of 
partitioning and distributing a graph over the machines. 
A key ingredient of the low complexity algorithms is that 
the decision to remove or add templates to the system 
is only made at the instances that a graph is submitted 
to the cluster or hnishes its computation; thus preventing 
interruption of ongoing services. 

1.3 Notations 

Some of the basic notations used in this paper are the 
following. I S'! denotes the cardinality of a set S. A\B is 
the set difference dehned as. {x £ A,x ^ B}. Ija: G A} is 
the indicator function which is 1 if a: € A, and 0 otherwise. 

is the n-dimensional vector of all ones. R+ denotes the 
set of real nonnegative numbers. For any two probability 
vectors n,i> £ R", the total variation distance between tt 
and 1 / is defined as ||7r — v\\tv = | 1’’^* “ ^*1- Further, 

the Kullback-Leibler (KL) divergence of tt from v is dehned 
as Dy.-l{-k\\v) = log^. Given a stochastic process 

z{t) which converges in distribution as t ^ oo, we let 2 ( 00 ) 
denote a random variable whose distribution is the same as 
the limiting distribution. Given x £ E."', a;inin = minimi, 

^max ~ Xi. 

2. SYSTEM MODEL AND DEFINITIONS 

Cloud Cluster Model and Graph-structured Jobs: Gonsider a 
collection of machines jC. Each machine I £ L has a set of 
slots mi which it can use to run at most \mi\ processes in 
parallel (see Figure [^. These machines are inter-connected 
by a communication network. Let M = \mi\ be the total 
number of slots in the cluster. 

There is a collection of jobs types J, where each job type 
j £ J described by a graph Gj{Vj, Ej) consisting of a set 
of nodes Vj and a set of edges Ej. Each graph Qj represents 
how the computation is split among the set of nodes Vj. 
Nodes correspond to computation with each node requiring 
a slot on some machine; edges represent data flows between 
these computations (nodes). 


Job Arrivals and Departures: Henceforth, we use the word 
job and graph interchangeably. We assume graphs of type 
j arrive according to a Poisson process with rate Xj, and 
will remain in the system for an exponentially distributed 
amount of time with mean Node of the graph must 

be assigned to an empty slot on one of the machines. Thus 
a graph of type Qj requires a total number of | Vj | free slots 
{\Vj\ < M). For each graph, data center needs to decide how 
to partition the graph and distribute it over the machines. 

Queueing Dynamics: When jobs arrive, they can either be 
immediately served, or queued and served at a later time. 
Thus, there is a set of queues Q(t) = (t) : j £ J) repre¬ 

senting existing jobs in the system either waiting for service 
or receiving service. Queues follow the usual dynamics: 

( 1 ) 

where and D^^\0,t) are respectively the number 

of jobs of type j arrived up to time t and departed up to 
time t. 

Job Partition Cost: For any job, we assume that the cost of 
data exchange between two nodes that are inside the same 
machine is zero, and the cost of data exchange between two 
nodes of a graph on different machines is one. This models 
the cost incurred by the data center due to the total traffic 
exchange among different machines. Note that this model 
is only for keeping notation simple; in fact, if we make the 
cost of each edge different (depending for instance on the 
pair of machines on which the nodes are assigned, thus cap¬ 
turing communication network topology constraints within 
the cloud cluster), there is minimal change in our description 
below. Specihcally, we only need to redefine the appropriate 
cost in and the ensuing analysis will remain unchanged. 

Templates: An important construct in this paper is the con¬ 
cept of template. Observe that for any graph Qj, there are 
several ways (exponentially large number) in which it can 
be partitioned and distributed over the machines (see Fig¬ 
ure [^. A template corresponds to one possible way in which 
a graph Qj can partitioned and distributed over the machines 
(see Figure [^. Rigorously, a template A for graph Qj is an 
injective function A : V) —>■ which maps each node 

of Qj to a unique slot in one of the machines. We use to 
denote the set of all possible templates for graph Qj. Tying 
back to the cost model, for A £ let 6^^ be the cost of 
partitioning Qj according to template A, then 

^ X] ^{A{x) £mi,A{y) i'}. (2) 

(x,y)eEj 


Configuration: While there are an extremely large number 
of templates possible for each graph, only a limited number 
of templates can be present in the system at any instant of 
time. This is because each slot can be used by at most one 
template at any given time. 

To track the collection of templates in the system, we let 
ft) C A^^'^ to be the set of existing templates of graphs 
Qj in the system at time t. The system configuration at each 
time t is then dehned as 

c{t) = (c^^\ty, j £ j). ( 3 ) 
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Figure 1: Illustrative templates for partitioning and 
distributing a five-node graph in a cluster of 3 
servers, with each server having 4 empty slots. In 
this stylized example, all the edges have unit “break¬ 
ing” costs, i.e., two connected nodes being scheduled 
on different servers incurs a unit cost. The cost 
of partitioning the graph according to these tem¬ 
plates is as follows: ^Template 1 — 1, f^Template 2 = 2, 
^Template 3=3, and ^Template 4=4. 


By definition, there is a template in the system correspond¬ 
ing to each job that is being served on a set of machines. 
Further, when a new job arrives or departs, the system 
can (potentially) create a new template that is a pattern 
of empty slots across machines that can be “filled” with a 
specihc job type (i.e., one particular graph topology). We 
call the former as actual templates, and the latter as vir¬ 
tual templates. Further, when a job departs, the system can 
potentially destroy the associated template. 

The set of all possible configurations is denoted by C. Note 
that this collection is a union of the actual and virtual tem¬ 
plates. Mathematically, = Ca'^ where Ca^ is the 

set of templates that contain actual jobs of type j and Cif'^ is 
the set of virtual templates, i.e., templates that are reserved 
for jobs of type j but currently do not contain any such jobs. 

System State and Updates: Finally the system state at each 
time is then given by: 

S(t) = (Q(t),C(t)). (4) 

It is possible that |C'’'-’^(t)| > Q^^\t) in which case not all 
the templates in are being used for serving jobs, these 
unused templates are the virtual templates. 

Define the operation as adding a feasible template 

A for graphs of type Qj to the configuration C, thus A will 
be added to while C^^ ^ remains unchanged for j' j, 
Dehne A^^\C) as the set of possible templates that can be 
used for adding a graph Qj when the configuration is C. 
Clearly, A £ A^^\C) must be an injective function that 
maps graph Gj to the available slots that have not been 
used by the current templates in the system configuration, 
i.e., 

U U 

\eec J \j'£j 



3. PROBLEM FORMULATION 

































Given any stationary (and Markov) algorithm for schedul¬ 
ing arriving graphs, the system state evolves as an irre¬ 
ducible and aperiodic Markov chain. Our goal is to minimize 
the average partitioning cost, i.e., 

minimize E [I] I] (5) 

subject to system stability 

where a;A(oo) is a random variable denoting the fraction of 
time that a template A is used in steady state. The system 
stability in § means that the average delay (or average 
queue size) remains bounded. There is an inherent tradeoff 
between the average delay and the average partitioning cost. 
For more lenient delay constraints, the algorithm can defer 
the scheduling of jobs further until a feasible template with 
low partitioning cost becomes available. 

Throughout the paper, let pj = \j/pj be the load of 
graphs of type Gj. 

Definition! (Capacity Region). The capacity region 
of the system is defined as 

A = G : dTT s.t. ^ 7r(C)|C'‘^^|, 

CGC 

J2^iC) = l,AC)>o], 

CGC 

where denotes the number of templates of graph Gj in 

configuration C. 

By the definition, any load vector z £ A. can be supported 
by a proper time-sharing among the configurations, accord¬ 
ing to TT. Equivalently, for any z £ A, there exists an 
X = [xa ■ A £ such that 

Zj = ^ XA-, j G J, 

AeA(j) 

where xa is the average fraction of time that template A is 
used, given by 

XA = X] 7r(C’)l(A G C'"'^); A £ j £ J. 

cec 

It follows from standard arguments that for loads outside 
A, there is no algorithm that can keep the queues stable. 
Given the loads p = [pj '■ j & ff], we define an associated 
static problem. 

Definition 2 (Static Partitioning Problem). 
mm G(x) 

jG-T' A6A0) 

subject to : E *A > Pj-, j €J (7) 

A6A0) 

XA > 0; A G UjA^^^ (8) 

[ XA-, j &j]&A (9) 

AeA(j) 

The constraints are the required stability conditions. 

In words, given pj graphs of type Gj, for j £ J , the static 
partitioning problem is to determine how to partition and 
distribute the graphs over the servers so as to minimize the 
total partitioning cost. For the set of supportable loads (p G 


A), the static problem is feasible and has a finite optimal 
value. 

If the loads pj ’s are known, one can solve the static parti¬ 
tioning problem and subsequently find the fraction of time 
7r(G) that each configuration C is used. However, the static 
partitioning problem is a hard combinatorial problem to 
solve. 

In the next sections, we will describe two approaches to 
solve the dynamic problem 0 that could converge to the 
optimal solution of the static partitioning problem, at the 
expense of growth in delay. The inherent tradeoff between 
delay and partitioning cost can be tuned in the algorithms. 
First, we describe a high complexity frame-based algorithm 
(based on traditional Max Weight resource allocation). Then, 
we proceed to propose our low complexity algorithm which 
is the main contribution of this paper. 

4. HIGH COMPLEXITY FRAME-BASED AL¬ 
GORITHM 

The first candidate for solving the dynamic graph parti¬ 
tioning problem is to use a Max Weight-type algorithm, with 
a proper choice of weight for each configuration. However 
changing the configuration of the system can potentially in¬ 
terrupt a significant number of ongoing services of the jobs 
in the system. Such service interruptions are operationally 
very expensive as they incur additional delay to the service 
or require the storage and transfer of the state of interrupted 
jobs for future recovery. Hence, to reduce the cost of service 
interruptions, one can reduce the frequency of configuration 
updates. In particular, we describe a Frame-Based algo¬ 
rithm which updates the configuration once every T time 
units. As expected, a smaller value of T could improve the 
delay and the partitioning cost of the algorithm at the ex¬ 
pense of more service interruptions. The description of the 
algorithm is as follows. 


Algorithm 1 Frame-Based Algorithm 
1: The configuration is changed at the epochs of cycles of 
length T. At the epoch of the fc-th cycle, k = 0,1, • • • , 
choose a configuration C*(kT) that solves 

?Jf2^E E («/- ^a’) ■ (10) 

jej- Aec'^ii 

If there are more than one optimal configuration, one of 
them is chosen arbitrarily at random, a > 0 is a fixed 
parameter and / is a concave increasing function. 

2: The configuration C*(kT) is kept fixed over the interval 
\kT, (k + 1)T) during which jobs are fetched from the 
queues and are placed in the available templates. It is 
possible that at some time, no jobs of type j are waiting 
to get service, in which case some of the templates in 
C*^^\kT) might not be filled with the actual jobs. These 
are virtual templates which act as place holders (tokens) 
for future arrivals. 


The algorithm essentially needs to find a maximum weight 
configuration at the epochs of cycles, where the weight of 
template A for partitioning graph Gj is 

WaH^) = - b’-A- 

The parameter a controls the tradeoff between the queue 





size and the partitioning cost of the algorithm. For small 
values of a, the algorithm defers deploying the graphs in 
favor of finding templates with smaller partitioning cost. For 
larger values of a, the algorithm gives a higher priority to 
deployment of job types with large queue sizes. 

The optimization ( |10[ ) is a hard combinatorial problem, as 
the size of the configuration space C might be exponentially 
large, thus hindering efficient computation of the max weight 
configuration. Theorem below characterizes the the inher¬ 
ent tradeoff between the average queue size and the average 
partitioning cost. 

Theorem 1. Suppose p{l+S*) £ K for some 5* > Q. The 
average queue size and the average partitioning cost under 
the Frame-Based algorithm are 

e[^/(Q^^''(oo)) 1 < (gi + B 2 T) + (1 + 5nG(xn/a ^ 
XA{oo)b^i'’] < G{x*) + a{Bi + B 2 T) 

jej AeA(T 

where x* is the optimal solution to the static partitioning 
problem, pmin = miuj pj, and B\,B 2 are constants. 

Hence, as a —>■ 0, the algorithm yields an a-optimal par¬ 
titioning cost, and an queue size. Also as ex¬ 

pected, infrequent configuration updates could increase the 
delay and partitioning cost by multiples of T. The proof of 
Theorem follows from standard Lyapunov arguments and 
can be found in the appendix. 


5. LOW COMPLEXITY ALGORITHMS WITH¬ 
OUT SERVICE INTERRUPTIONS 

In this section, we develop a low complexity algorithm 
that can be used to solve f5f without interrupting/migrating 
the ongoing services. Before describing the algorithm, we 
first introduce a (modihed) weight for each template. Given 
the vector of queue sizes Q(t), and a concave increasing 
function / : R+ —>■ R+, the weight of template A £ 
j £ ff, is defined as 

= + -b<i\ ( 11 ) 

where : R+l'^' —>■ R+ is 

f^^\x) = max|/(a:j), Xmax = (12) 

where h = h1^j\, and a,h £ R+, and e £ (0,1) are the 
parameters of the algorithm. 

At the instances of job arrivals and departures, the al¬ 
gorithm makes decisions on the templates that are added 
to/removed from the system conhguration. It is important 
that the addition/removal of templates by the algorithm 
does not disrupt the ongoing service of existing jobs in the 
configuration. 

The low complexity algorithm is a randomized algorithm 
in which the candidate template to be added to the config¬ 
uration is chosen randomly among the set of feasible tem¬ 
plates. In particular, the following Random Partition Proce¬ 
dure is used as a subroutine in our low complexity algorithm. 


Algorithm 2 Random Partition Procedure 

Input: current configuration G, and a graph Q{V,E), V = 

{wi,--- ,n|v|} 

Output: a virtual template A for distributing Q over the 
machines. 

1: fc ^ 1 

2: slot-available £- 1 
3: while k < \V\ and slot-available do 
4: if there are no free slots available on any of the ma¬ 

chines then 

5: slot-available 0; A 0 

6: else 

7: place Vk uniformly at random in one of the free slots 

8: A{vk) = index of the slot containing Vk 

9: k <— k -\- 1 

10: end if 

11: end while 


When a random template is generated according to Ran¬ 
dom Partition Procedure, the decision to keep or remove the 
template is made probabilistically based on the weight of the 
template. The description of the low complexity algorithm 
(called Dynamic Graph Partitioning (DGP) algorithm) is as 
follows. In the the description, /3 > 0 is a fixed parameter. 


Algorithm 3 Dynamic Graph Partitioning (DGP) 

Arrival instances. Suppose a graph (job) Qj arrives at 
time t, then: 

1: This job is added to queue 

2: A virtual graph Qj is randomly distributed over the ma¬ 
chines, if possible, using Random Partition Procedure, 
which creates a virtual template A^^'^ for distributing 
a graph Qj over the machines with some partitioning 
cost Then, this virtual template is added to the 

exp(4tS ((''(*+)) 

current configuration with probability --, 

l-t-exp(^7ij^ {t+)) 

otherwise, it is discarded and the configuration does not 
change. The virtual templates of type j leave the system 
after an exponentially distributed time duration with 
mean 1/pj. 

3: If there is one or more virtual templates available for 
accommodating graphs of type Qj, a job from (e-g-, 
the head-of-the-line job) is placed in one of the virtual 
templates chosen arbitrarily at random. This converts 
the virtual template to an actual template. 

Departure instances. Suppose a departure of a 
(virtual or actual) template A^^'^ occurs at time t, 
then: 

1: If this an actual template, the job departs and queue 
is updated. 

2: A virtual template of the same type A^^^ is added back 

exp(im((>(t+)) 

to the configuration with probability -— rr^ -. 

^ l+exp(l*((>(t+)) 

3: If a virtual template for accommodating a graph Qj is 
available in the system, and there are jobs in 
waiting to get service, a job from (e.g. the head- 
of-the line job) is placed in one of the virtual templates 
chosen arbitrarily at random. This converts the virtual 
template to an actual template. 













To simplify the description, we have assumed that the 
system starts from empty initial configuration and empty 
queues but this is not necessary for the results to hold. We 
emphasize that the DGP algorithm does not interrupt the 
ongoing services of existing jobs in the system. The following 
theorem states our main result regarding the performance of 
the algorithm. 

Theorem 2. Suppose p(l + 5*) £ A for some 0 < <5* < 1. 
Consider the Dynamic Graph Partitioning (DGP) algorithm 
with function 

fix) = log^“*’(a;); b £ ( 0 , 1 ), 

and parameters 

a<l3<P, e<S ■ h > exp (^Co-pi-) j, 

where Co is a large constant independent of all these param¬ 
eters. Then the average queue size and the average parti¬ 
tioning cost under the DGP algorithm are 

y e[/(Q^(00))1 < K^- I log ^min + 

^ L J PminO* V a 

i(l + 572)G(x*) + - 6 ^a.'), 

a a / 

®[E E a:A(oo)6^^j < G{x*) + a{K 2 + Ko) 

isj Ae.A(3) 

- /Slog '^min + eh max 5 

where x* is the optimal solution to the static partitioning 
problem, pmin = minj pj, Ko < f'ih)iM-l-J2j Pj) -^3 < 
f{M + h)M, and 7 min, und 6 max are constants. 

We would like to point out that in the above theorem the 
bounds are explicit for any choices of a, j3, e, h. The constant 
7 min is mine 7 c for a distribution 7 to be defined in ( |14[ ) and 
has a loss-system interpretation (see Step 1 in the Proof of 
Theorem]^, and femax is the maximum partitioning cost of 
any job type (which is obviously less than M^). 

The parameter h is called the bias and adds an offset to 
the queues to ensure the algorithm operates near the optimal 
point at (effectively) all times. The parameter (3 has the 
similar role as the temperature in Gibbs sampler. As /3 —> 
0 , in steady state, the algorithm generates configurations 
that are closer to the optimal configuration, however at the 
expense of growth in queue sizes. We refer to Section]^ for 
the proof and also more insight into the operation of the 
algorithm. 

The following corollary gives an interpretation of the re¬ 
sult for a particular choice of the parameters. 

Corollary 1. Choose a = /3^, h = exp j, 

e = then as P ^ 0, 

yE[/(Q7oo))] <e((i)"), 

e[E E 2 ;a(oo)&^^] < G(a;*)-be(y/^). 

jej AeA(2'i 

The corollary above demonstrates how the choice of P con¬ 
trols the tradeoff between approaching the optimal parti¬ 
tioning cost and the queueing performance. 


Remark 1. Comparison with CSMA: In the setting of 
scheduling in wireless networks, the Gibbs sampler has been 
used to design CSMA-like algorithms for stabilizing the net¬ 
work Our algorithm is different from this line 

of work in three fundamental aspects: 

(i) Not relying on Gibbs sampler: Unlike wireless networks 
where the solutions form independent sets of a graph, there 
is no natural graph structure analog in the graph partition¬ 
ing. The Gibbs sampler (and CSMA) can still be used in our 
setting by sampling partitions of graphs, where, each site of 
the Gibbs sampler is a unique way of partitioning and pack¬ 
ing a graph among the machines. The difficulty, however, is 
that there are an exponentially large number of graph par¬ 
titions, leading to a correspondingly large number of queues 
for each type of graph. A novelty of our algorithm is that we 
only need to maintain one queue for each type of graph. This 
substantial reduction is achieved by using Random Partition 
Procedure for exploring the space of solutions. This leads to 
minimizing a modified energy function instead of the Gibbs 
energy; specifically, the entropy term in the Gibbs energy is 
replaced with the relative entropy with respect to a prob¬ 
ability distribution that arises in an associated loss system 
(see Step 1 in Section]^. 

(iii) No service interruptions: Our low complexity algo¬ 
rithm performs updates at appropriate time instances with¬ 
out causing service interruptions. 

(iii) Adding bias to the queues: The queue-based CSMA 
algorithms are concerned with stability which pertains to 
the behavior of the algorithm for large queue sizes. This 
is not sufficient in our setting because we are not only con¬ 
cerned with stability, but more importantly with the optimal 
(graph partitioning) cost of the system. The bias h boosts 
the queue sizes artificially to ensure that the system operates 
effectively near the optimal point at all queue sizes. With¬ 
out the bias, when the queue sizes are small, the optimal 
cost of the algorithm could be far from optimal. 

Remark 2. An Alternative Algorithm: An alternative de¬ 
scription of the algorithm is possible using a dedicated Pois¬ 
son clock for each queue (independent of arrivals) where the 
template decisions are made at the ticks of the dedicated 
clocks. We have presented this alternative algorithm in the 
appendix. 


6. PROOFS 


In this section, we present the proof of of Theorem 
Before describing the proof outline, we make the following 
definition. 

Definition: DGP(IU). Consider the dynamic graph parti¬ 


tioning algorithm with fixed weights W = A £ A^^\j£ 

J'\, namely, when weights are not chosen according to 0 
but they are simply some fixed numbers all the time. With 
minor abuse of notations, we use DGP(IU) to denote this 
algorithm that uses weights W all the time. Description 
of DGP(IT) is exactly the same as the dynamic partition¬ 
ing algorithm, except that at arrival/departure instance at 
time t, the decision to add/keep a virtual template A^^ ^ is 


made according to probability 


l+exp(it5((> 


independently 


of Q(t). 


Proof Outline. The proof of Theorem has three steps: 


Step 1: We analyze the steady-state distribution of configura¬ 
tions under DGP(IU) with fixed weights W, and show 










that for small values of /3, DGP(H^) will generate con¬ 
figurations which are “close” to the max weight config¬ 
uration, when the template weights are per W. 


Step 2: We show that when weights are chosen according to 
(111, although the weights W{t) are time-varying, , 
the distribution of configurations in the system will be 
“close” to the corresponding steady-state distribution 
of DGP(W(f)), for all times t long enough. We show 
that such “time-scale decomposition” holds under the 
suitable choice of the bias h and the function /. 


Step 3: Finally, we stitch the dynamics of queues and con¬ 
figurations together through Lyapunov optimization 
method to compute the queueing and partitioning cost 
of our algorithm. 


Step 1: Steady-State Analysis of DGP{W) 

Under DGP(W), the configuration of the system evolves as 
a “time-homogeneous” Markov chain over the state space C. 
Note that from the perspective of evolution of configuration 
in the system, we do not need to distinguish between virtual 
and actual templates, since transition rates from any config¬ 
uration C do not depend on whether the templates in C are 
actual or virtual. To see this, consider any virtual template 
of graphs Qj in C{t). No matter if the virtual template is 
filled with an actual job or not, the residual time until the 
departure of this template is still exponential with rate n-jj 
due to the memoryless property of exponential distribution 
and because both virtual templates and jobs have exponen¬ 
tial service times with the same mean l/^j- The following 
proposition states the main property of DGP(VU). 


Proposition 1. Consider the T)GP{W) with fixed weights 
W = £ J\. Then in steady state, the 

distribution of configurations tt will solve the following opti¬ 
mization problem 

max 0DKL{Tv\\y), (13) 

‘,Ecec ^"0=1 jeJAecO) 

where Dkl{- || •) is the KL divergence of n from the proba¬ 
bility distribution 7 , where 

7c = ifel’"<l-Elc‘'iir,l')!n«'“™.C£OiD 

\ I J J 3 

and Z~, is the normalizing eonstant. 

Before describing the proof of Proposition[^ we briefly high¬ 
light the main features of DGP(W) algorithm: 

(i) The algorithm does not interrupt the ongoing services 
of existing jobs in the system and does not require 
dedicated computing resources. 

(ii) The algorithm is different from Gibbs sampler as it 
does not maximize the Gibbs energy. The entropy 
term iL( 7 r) in the Gibbs energy has been replaced by 
the relative entropy OklIw,^). 

(iii) The distribution 7 has the interpretation of the steady- 
state distribution of configurations in an associated loss 
system defined as follows: at arrival instances, the ar¬ 
riving graph is randomly distributed over the machines 
if possible (according to Random Partition Procedure), 


otherwise it is dropped; at the departure instances, the 
job (and hence its template) leaves the system. 

Proof of Proposition [TJ Gonsider the maximization prob¬ 
lem 

max 

subject to ffc&c ~ 1 
7r(C) > 0, VG £ C. 

with function as in ( |l3| , which is 

C jej C 

+P^'^{C) log 7 (G). 
c 

Notice that F^^\tt) is strictly concave in tt. The lagrangian 
is given by L( 7 r, 17 ) = F^^^ (tt) + rjfiffc ^(C) “ 1 ) where 17 £ R 
is the lagrange multiplier. Taking dL/dwiC) = 0 yields 

7r(G) =exp(-l +^) 7 (C)exp(i ^ ^ wf'>)-C£C, 

which is automatically nonnegative for any rj. Hence, by 
KKT conditions {n* ,rf) is the optimal primal-dual pair if 
it satisfies ~ 1- Thus the optimal distribution tt* 

is 

7r*(C') = ;^7(C')exp(i ^ (15) 

^ ^ j&j a£C(T 


where Zp is the normalizing constant. 

Next we show that the DGP(W) algorithm indeed pro¬ 
duces the steady-state distribution ( |15[ ) with the choice of 7 
in ( |14| , by checking the detailed balance equations. Con¬ 
sider a template for graphs of type Qj. The detail 

balanced equation for the pair C and C © A^^\ such that 
G©A(^^ £C, is given by 


n{CS)A^^'>)n. 


Si 




es^A 


r,(j) ' 


The left-hand-side is the departure rate of (virtual or actual) 
template from the configuration C © A^^\ The right- 
hand-side is the arrival rate of (actual or virtual) graphs Qj 
to the configuration C that are deployed according to tem¬ 
plate A^^'^ chosen uniformly at random from (Recall 

that Random Partition Procedure used in the algorithm se¬ 
lects a template A £ A^^\C) uniformly at random). Thus 
the detailed balanced equation is simply 


■k{C(BA^^'^) = 7r(G) 


Pj 


|MO)(G)| 


6/3 -4 . 


Noting that 


IE I 


IE I 


(16) 


it is then easy to see that (151 with 7 as in (141, indeed 
satisfies the detailed balance equations, and the normalizing 
condition that ffcjwiC) = 1. This concludes the proof. □ 


The parameter fi has the similar role as the temperature in 
Gibbs sampler. As 0, in steady state, the DGP(TT) 

















algorithm generates configurations that are closer to the op¬ 
timal configuration with maximum weight 

(17) 

jej AecU) 


The following corollary contains this result. 

Corollary 2. Let be the optimal objective 

function in .13). The algorithm DGP(TT) is asymptotically 
optimal in the sense as fl ^ 0, 
over, for any /3 > 0 , 


W*. More- 






[E E “A^l^max^ ^ 


^0) 


j€J 


jej Aec(3) 


A +/3 mm logic 


Proof of Corollary [2] Let C* be the maximizer in 
( |17| l . As a direct consequence of Proposition 

[E E “a'] - II 7) > - PDiSc* II 7). 

iej Aec'-T 

Since D{v || 7 ) > 0, for any distribution v, 

e.*(«[E E *a^] ^ w*-msc*\\y) 

lej Aec^j) 

= IT*-I-/3logic* 

> IT*-|-/3 min logic. 

“ cec 

□ 


Step 2: Time-Scale Decomposition for DGP(lT(t)). 

Recall that DGP(TT(t)) denotes the algorithm that uses the 
weight IT (t) at all times s > 0. With minor abuse of no¬ 
tation, we use DGP(lT(t)) to denote the Dynamic Graph 
Partitioning algorithm (Section]^ and its associated time- 
inhomogeneous Markov chain over the space of configura¬ 
tions C. The weights W{t) are time-varying (because of the 
queue dynamics), however the DGP(lT(t)) algorithm can 
still provide an adequately accurate approximation to the 
optimization 0 at each time, for proper choices of func¬ 
tion / and the bias h. 

Roughly speaking, for the proper choices of / and h, f{h + 
will change adequately slowly with time such that 
a time-scale separation occurs, i.e., convergence of Markov 
chain DGP(lT(t)) to its steady state distribution will 
occur at a much faster time-scale than the time-scale of 
changes in f(h -|- Q^^\t)) (and thus in the weights). Hence, 
the probability distribution of configurations under DGP(tT(t)) 
will remain “close” to (the steady state distribution 

of configurations under DGP(lT(t))). The proof of such a 
time-scale separation follows from standard arguments in 
e.g., (9|[T^[1]. 

We first uniformize (e.g. [18[|21| ) the continuous Markov 
chain S(t) = (C(t), Q(t)) by using a Poisson clock N^{t) of 
rate 

i = (18) 

Let S[k] = (C[fc], Q[fc]) be the corresponding jump chain of 
the uniformized chain. Note that S[k] is discrete time and 
at each index k, either a graph Qj arrives with probabil¬ 
ity -A, or a (virtual/actual) template of type j leaves the 


system with probability or S[fe] remains unchanged 

otherwise. The following proposition states the main “time- 
scale decomposition” property with respect to the associated 
jump chain (which can be naturally mapped to the original 
Markov chain). 


Proposition 2. Let Vn denote the (eonditional) probabil¬ 
ity distribution of eonfiguration at index n given the queues 
Q[n] under DGP(lT(Q[n])). Let 7 r„ be the steady state dis¬ 
tribution of configurations corresponding to DGP(1T(Q[n])). 
Given any 0 < e < 1, and any initial state S[0] = (Q[0], C[0]), 
there exists a time n* = n*(e, /3, S[0]) such that for all n > 
n*, || 7 r„ - Vn\\TV < e/16. 


Corollary 3. Given 0 < e < 1, for all n > n* (e, /3, S(0)), 

Ei.„[E E w’A'(ei)] >/3mmlogic - e6max 

3 Aec(3) 

lej Aecdi 

Proof. Consider any n > n* (e, /3, S(0)). Let lT*(n) := 
maxcgc ffjizj f2Aec(T ^a^ (e^)- First note that from Corol¬ 
lary]^ Proposition]^ and definition of || • ||ty, 

^‘'"[E E ^aH”.)] =E^„ wii"'(n)] 

3 Aec^T 3 Aec<.3'> 

+ E “ ^"('^))E E F>A^(n)] 

c 3 Aec(3i 

> lT*(n) -I- ,0 mm log ic — 2(^)lT*(n) 

= (1 - |)lT*(n)-b^mmlogic. (19) 

Next, note that by the definition of u)^^ (see 
any j £ J, A £ 

w^f\n) < wf\n) < w^ffn) -f -^fiQmaxin) -f h), 

hence for any configuration C £ C, 

0<E E (*a’('^) - ^ a^/(Qmai(n)-h h). 

jej cecO) 


ID. ( 121 ), for 


Suppose <5^1 i(n) = (3max(n) for some j' £ J. Then for any 
A' £ 


a/(Qmax(n)-I-h) - = w%\n) 


^ ^t?E E 


jej Aec(3) 


Therefore, it follows that 


jej cec(3) 


0 < E E (*a’ (n) - w^A (^)) ^ ^ max ^ ^ 


„(j) 


3SJ cec(j) 




Let lT*(n) := maxcecl/jgj I]^gcO) ^a'(”-)- Using the 











above inequality and (|19[), 


i Aec(^^ 

^bruaa 

3 A£C<-J) 

> (1 - l)W*{n) + /llogy^i, - ^-W\n) - %, 

> (1 - ^W*{n) + l3\og 'ymin Q bmax ■ 

4 O 


□ 


Proof of Proposition]^ Below we mention a sketch 
of the proof of the “time-scale decomposition” property for 
our algorithm. 

Let <1?'^ be the infinitesimal generator of the Markov chain 
(C'(t)) under DGP(fP (Q)), for some vector of queues Q. Let 
= I + denote the corresponding transition prob¬ 


ability matrix of the jump chain ((7[n]), obtained by uni- 
formizing (C'(f)) using the Poison clock N({t) of rate ^ in 
(18 1 . We use P^{C, C') to denote the transition probability 
from conhguration C to conhguration C'. 

The Markov chain {C[rt\) is irreducible, aperiodic, and 
reversible, with the unique steady-state distribution tt in 
( |15| ). In this case, it is well known that the convergence to 
the steady-state distribution is geometric with a rate equal 
to the Second Largest Eigenvalue Modulus (SLEM) of P^ 
[^. Further, using the choice of ^ in ( |18[ ), {C[n]) is a lazy 
Markov chain because at each jump index n, the chain will 
remain in the same state with probability greater than 1 / 2 . 
In this case, for any initial probability distribution /ro and 
for all n > 0 , 


||/ro(P®)” - ttIItv < 02 


1 

min 


( 20 ) 


where 62 is the second largest eigenvalue of P ^, and TVmin = 
minc' 7 r(C'). Correspondingly, the mixing time of the chain 
(defined as inf{n > 0 : ||i^(n) — n{n)\\TV < 5}) will be less 


.V - log(2S^W—) 

man (^ 02 ) 

LemmalTl below provides a bound on 62 and hence on the 

I—I 0 

convergence rate of Markov chain P^. 


Lemma 1. Let Ko = . Then, 

^ exp p(M^+ 1 ) (^j(Q^^^ + , ( 21 ) 


Proof of Lemma [TJ It follows from Cheeger’s inequal¬ 
ity that ^ ^he conductance of 

the Markov chain P^. The conductance is further bounded 
from below as 


( 22 ) 


Under DGP(tU), with W = W{Q), 

min P?(C,C') — mi 

Cj^c t ' .e ,■ 


min-A 




^ A \ j ) exp(^w^^„) A 1 

“ ^M\ l + exp{^Wmax) 

> ^j(MiAAj) exp(^b^aa,) 


^M! 1 + exp {^ f{Qmax + h )) 

Note that the steady state distribution of the jump chain 
is still 7r(C) = ^ exp(| T,Aec *a')’ T' defined in 
(^. Then 

tZ'maic )! (p max V 1 ) 


CGC 


,Ma 


j Aec 


< \C\exp{^f{Qmax+h))M\{pmax\/l)^, 

therefore 

TTmin > Kl exp f {Q max h) — 

^ ^max ^ 5 (23) 

/ \ M 

where A'l = Hence 

^ exp ^ ^ ^ {afiQmax -I- h) -I- bmax)'j- 

where Kn = K, ^ □ 


Lemma 2. For any configuration C gC, e < 

e'^’*, where 

+ Qmaxin + 1))) - 1) . (24) 

Proof of Lemma [2j Note that 
7r»+i(G) ^ Zr,{P) T.. /»>(Q(^+i)+fc)-/P’(Q(»)+'i) 

TVniC) + 

It is easy to show that 

< maxe^^^'.-^ecO) /<^>{Q{"+i)+G-/P'>(QW+G 

Z„+i{P) - c 

Let Q*(n) := f~^{-^f{h+Q,naxin)))-h,andde&neQ^^\n) := 
max{Q*(n), Q^-’^n)}. Then, 

E)(Q(n + l) + /i)-E)(Q(n) + /i) 

= + l) + h)- (n) + h) 

<f'{Q(i\n+l) + h-l)\Q‘'^'>{n + l)-Q^^\n)\ 

< f'{Q*{n -I- 1) + h — 1) 

= + 1 ))) - 1 ) 

where we have used the mean value theorem and the facts 
that / is a concave increasing function and at each index n, 
one queue can change at most by one. Therefore, 

^”+l(C') < 2 ^f'(f-p^f(h+Qrr.ax(n + m-l) 

7 Vn{C) ~ 

A similar calculation shows that also 

^ 23 ^f'(f-^(-gj^f(h+Q,^ax(n + m-l) 

7r„+i(C') “ 


T(Pe) > 27r™„ mn Pf(G,G'). 






























This concludes the proof. □ 

Next, we use the following version of Adiabatic Theorem 
from 23 to prove the time-scale decomposition property of 


From (|23|), and since a < /5, 


our algorithm. 

Proposition 3. (Adapted from \23^ ) Suppose 

- - - T < <574 for all n > 0, (25) 

1 — 02 (n + i) 

for some S' > 0, where 6^2 (n -|- 1) denotes the seeond largest 
eigenvalue of Then || 7 r„ — Un\\TV < 3', for all 

n > n*{P,S' ,S{0)), where n* is the smallest n such that 


min (0) 


exp(- ^(1 - e 2 ik)f < S'. 


(26) 


In our context, Propositionj^states that under (25 \ and (261, 
after n* steps, the distribution of the configurations over 
templates will be close to the desired steady-state distribu¬ 
tion. To get some intuition, an has the interpretation of the 
rate at which weights change, and 1/(1— 02 (n-|-l)) has the in¬ 
terpretation of the time taken for the system to reach steady- 
state after the weights change. Thus, condition ( |25| l ensures 
a time-seale decomposition - the weights change slowly com¬ 
pared to the time-scale that the system takes in order to 
respond and “settle down” with these changed weights. 

It remains to show that that our system indeed satisfies 
the conditions of Proposition!^ as we do next, for the choice 
of S' = jg. Suppose f{x) = log^“*’(®), for some 0 < 6 < 1. 
Let y = f{Qmax{n -|- 1) -f /i). Obviously f'[x) < l/x, so in 
view of equations (241, (211, (25 I, it suffices to have 


2Ma 


P /-Hiiry)-! 


exp 


4M 

^ \ 0 max 


+ ay) 


< 


128^2 


Note that / ^(a;) = exp(a;i-'>). Suppose a < (3. A simple 
calculation shows that it suffices to jointly have 

8 M i_i, 
y> — log 3, 
e 


^MpmaxV - \i^V) < 0 , 


K(e 


AM , 1 , e N J-c ^ 1 

- og 512^2■ 

In summary, the condition ( |25[ ) holds if 

or as a sufficient condition, if 

/ 11 \ 

h > exp (^C'o-(-) j 

for 


(27) 


(28) 


M 

-log(7rmin(0)) < log Al -I- MfimaxfiQmaxiO) + h) + —6„ 
Using Lemma it can be shown that 

n*-l 

E(i-^ 2 (fc))" 

n* —; 




iMp.ma.xSiQma.xW + h) 


Kl 


fc =0 

n* —1 


> yy g-4MM„,ax/(Q™ax{0)-|-h + ri*) 

^0 


k=0 


2 e -4 




> n'''{QmaxiO) + h +n*) 1 °*'’ . 

For h > exp{{8MpLrnax)^^^), it then suffices that 


2 f 4 

Kf 


^TTICIX . 1 /O 

^ n {Qmax{0) + h + n > 


log( “) + ^ log + ^^^^fiQnraxiO) + k) + ^6™. 

which is clearly satisfied by choosing n* = Qmax (0) -I- h for 
h in ( |28[ ) and Co a large enough constant. 

Step 3: Lyapunov Analysis 

The final step of the proof is based on a Lyapunov opti¬ 
mization method [^. We develop the required Lyapunov 
arguments for S{k) = {Q{k),C{k)), i.e., the jump chain of 
the uniformized Makov chain. Consider the following Lya¬ 
punov function 

V{k) = E -F{Q^'\k) + h), 

where F{x) = /(rjdr. Recall that f{x) = log^”^®. 
Therefore F is convex, and following the standard one-step 
drift analysis 


V{k + 1) - V{k) < 
1 

. Uj 


E —fiQ^y\k + l) + h)(^Q^^\k + l)-Q^^\k)) = 

jej 

E + h)(Q^^\kAl) - Q^^\k)) + 

jsj 

E ^ + i) + h)- f{Q^^\k) + h)) +1) 

Q^^\k)). 


jSJ 


By the mean value theorem, and using the fact that / is a 
concave increasing function, it follows that 


Co > 8M(8Mbn... -f 2| log ^^1 + 2 + (8M)"/• 
Next, we find n* that satisfies 

n*-l 

E (1 - > “log(^) - l^logi-^minlQ)). 

k=0 


l/(Q«7fc +1) + h) - /(g«7fc) + h)\ < 

/'(h)|gl^7fc + l)-Ql2)(fc))| 

Recall that C{k) = (^Ca\k),ci'’\k)'j where is the 
set of virtual templates (i.e., the templates that do not con¬ 
tain jobs of type j) and Ca'^ is the set of actual templates. 
























For notational compactness, let Es{j,)[-] = E[-|S(fc)], where 
S{k) is the state of the system at each index k. Then 

Es(k) [V{k + 1) - Vik)] < f'(h) ^ ^ 

J2^fiQ^^\k) + h)[^ - (|C«(fc)| - |C70)(fc)|) ff], 

where we have used the fact that at most one arrival or 
departure can happen at every jump index, i.e., \Q^^\k + 

l)_g(i)(fc)| £ { 0 , 1 }. 

Note that clearly the maximum number of templates of 
any type of jobs that can fit in a configuration is less than 
M (recall that M = Moreover, none of the 

templates of type j will be virtual if more than M jobs of 
type j are available in the system, hence, 

\Ci^\k)\f{h + Q^-^\k)) < \&^\k)\}{h + M). 


Notice that equivalently 

W*{k) = max XA{af{Q^^'’{k)+ h)-by\ 

subject to ( yy XA\j € J”) G A 

Ag^(j) 

xa > 0; VA G 

Let X* be the optimal solution to the static partitioning 
problem. By the feasibility of x*, Pj < X^AgAO) ®Ai for Ml 
j £ J, hence 

y2,pjf{Q‘'^\k) + h)<yy yy x*Af(Q^^\k)+h). (so) 

ieJ jeJ AgA(j) 

Further, by assumption, p is strictly inside A, thus there 
exists a S* such that p(l + 5*) G A. It is easy to show by the 
monotonicity of C (i.e., if C G C, C\A G C, for all A G 
j G J) that, at the optimal solution, the constraint 0 
should in fact hold with equality. Hence 


and therefore, 

lEs(k) [vik + 1) - V(fc)] <K 2 + K 3 

+ -^y 2 fiQ^'\k)+h)(^pj-\c^^\k)\) 


Therefore, 


y2 a;*A(l + <5*) 

AgA(j) 


■■j€j 


G A. 


w*{k)>yy yy ((i+5)x*A)(«/(Q«)(fc) + h)-6«)(3i) 

je-T' AgA<j) 


where K 2 < f {h){M + and K 3 < Mf{h + M)/^. 

Therefore, it follows that 

aEs(k) [V{k + 1) - V{k)] + iEs(k) [E E ^a] < 

jej Aeci^^k) 

a(K 2 + K3) + jy 2 + h) 

E [«/(o“w+'.)-i'lf] 

j£J AgC(j){fc) 

Taking the expectation of both sides with respect to v„ (dis¬ 
tribution of configurations given the queues at n > n*), we 
get 

«Eq(k)[H(fc + l)-H(fc)] +Eq(k)[^ ^ XA{k)bf] 

jej AgA<j) 

< a{K 2 + f E PjfiQ^'^k) + h) 

-7EQ(k)[E E {c^fiQ^^Hk) + h)-by)] 

^ jej Aec^Hk) 

< a{K2 + ^ 3 ) - f logy^n + I E 

^ ^ jdJ 

_ 1(1 _£)W/*(fc)+ (29) 

where the last inequality is based on Corollary where 

IT*(fc) = maxE E {yfiQ^^\k) + h)-by'^ 

ie-T' Aec(i> 


For e < S*, (1 — |)( 1 -|- S*) > 1 for any 5 G [0, <5*]. Then 


using p0| ) and pT] ) in p9| ) 
aEq(k)[nfc + l)-nfc)] +Eq(k)[^ ^ XA{k)b^y] 


jej AgA(j) 


< a(fF2 + ^3) + I ^ ^ x\f{Q^^\k) + h) 

3^^ AeA(j) 




(1 + ^)E E 2;*A(-&A^+a/(Q^^’(fc) + h)^ 


iSJ" A^A^i't 


P , , e. 

^ log'ymm ^ Omax 


= ^(l + |)G(**)-f|E E x*Af{Q‘'^\k) + h) 


^2 


jej Aec(3'>(k) 


- J log 7m™ -I- |femax -f 0(7^2 -f K 3 ). 


(32) 


It follows from this that the Markov chain (Q(fc),C(fe)) is 
positive recurrent as a consequence of the Foster-Lyapunov 
theorem, with Lyapunov function H(-)- Taking the expec¬ 
tation of both sides of (321 with respect to Q(fc), and then 
taking summation over — 1, and dividing by N, 

and letting N ^ oo yields 

iV-l 

hmsup- E E®[/(Q^''(fc) + l^)] 

^ k=0 j&J 

ai{K 2 + K 3 ) - Plog 7m™ -I- efemax + (1 + ^/‘^)G{x*) 


< 


OLpvninb j‘ 2 , 


iv -1 

limsup— ^E[G(a:(fc))] 


< (1 -k 5/2)G(x*) -k ai{K 2 -k K 3 ) - /3 log 7 m™ -k ebn 





where we have used the fact that contribution of queue sizes 
and costs in (0, n*] to the average quantities vanishes to zero 
as —>■ oo. The above inequalities can be independently 
optimized over 5 £ [0, 5*] (the performance of the algorithm 
is independent of 5). Here we choose 5 = 5* in the queue 
inequality and 5 = 0 in the cost inequality. The statement 
of Theorem then follows using the Ergodic theorem and the 
fact that the jump chain and the original chain have the 
same steady-state average behaviour. 

7. CONCLUSIONS 

Motivated by modern stream data processing applications, 
we have investigated the problem of dynamically scheduling 
graphs in cloud clusters, where a graph represents a specific 
computing job. These graphs arrive and depart over time. 
Upon arrival, each graph can either be queued or served 
immediately. The objective is to develop algorithms that 
assign nodes of these graphs to free (computing) slots in the 
machines of the cloud cluster. The performance metric for 
the scheduler (partition graphs and map it to slots) is to 
minimize the average graph partitioning cost, while keeping 
the system stable. 

We have proposed a novel class of low complexity algo¬ 
rithms which can approach the optimal solution by exploit¬ 
ing the trade-off between delay and partitioning cost, with¬ 
out causing service interruptions. The key ingredient of the 
algorithms is the generation/removal of random templates 
from the cluster at appropriate instances of time, where each 
template is a unique way of partitioning a graph. 
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APPENDIX 

A. PROOF OF THEOREM 1 

The proof is standard and based on a Lyapunov optimiza¬ 
tion method . Consider a Lyapunov function V (t) = 
where F{x) = f(r)dT. Recall that 
/ : 11+ —>■ R+ is a concave increasing function; thus F is 
convex. Choose an arbitrarily small u > 0. It follows from 
convexity of F that for any t > 0, 

V(t + n)- V(t) < (t + u)- gO) (t)) 

+ E -+ “)) - {t + u)- gO) it)). 

j&j 

By definition 


jobs in the system. Hence it follows that 


E —^si^T)\D^^\t,t + u)f{Q^^\t))\ > 
jej 

E®s(kT) [|C'^'’^(t)|/(g^'’^(!))]« - KiU-o{u) 


jej 

for it's = M f{M). Note that the algorithm keeps the config¬ 
uration fixed over intervals [kT, {k + l)T), i.e., C{t) = C{kT) 

forte [fcT, (fc + l)T). Let At,„ := iEs(kT) + , 


then. 


At,„ < Es(kT)[E/(Q''’W)(p^-|C^''^(fc2-)|)] 

+K2 + K3 + 0(1) 

jsj 

+kIt + K2 + a'3 + 0(1) 


Taking the limit m —^ 0, 

dEs(kT) ^ 

- ^ i < ^/(g'^)(fer)(p, -|C7(^)(fcT)|) +A 4 

where A '4 = K\T -|- 7^2 -I- A' 3 . Let A*, = Es(kT) + 

l)r) — U(A:r)j, then over the fc-th cycle 

^ Afc -I- Es(kT) [E E - Termi - Term 2 -I- aA' 4 , 

16-7' A6C0)(feT) 


qO) (t -p tt) _ gO) (t) = jfO) (t, t + u)- D^^'> [t, t + u), 
where for any 0 < ti < t 2 , 

H^^\tl,t2)^Mf{Xj{t2-tl)) 

D^^\ti,t 2 ) = Mf { [ Ga(r)ptjdr), 

Jti 

where Nf{z) and N^{z) denote independent Poisson ran¬ 
dom variables with rate z, for all j G fj. Recall that C{t) = 

(^Ci^\t),Ci^\t)'^ where Ci^'^ is the set of virtual templates 
(i.e., the templates that do not contain jobs of type j) and 
Ca ^ is the set of actual templates. It is easy to see that 

\fiQ^^\t + n)) - /(g'^>(t))| < /'( 0 )|g(^')(t + u) - g(^)(t)|, 

by the mean value theorem, and the fact that / is a con¬ 
cave increasing function. For notational compactness, let 
Es(t)[-] = E[-|S(t)[, where S(t) is the state of the system at 
each time t. Clearly the the maximum number of templates 
that can fit in a configuration is less than M. It is easy to 
see that 

Es{kT) [U(t + u)- l/(t)] < 

E — ®s(kT) \f{Q^^\t)){A‘'^Ht,t + u) - D^^\t,t + u)l 

+ K2U -P o(u), 

where K 2 = f'{0) ^^(Pj + ^)- Clearly virtual templates 
jej 

do not exist in the configuration if there are more than M 


where 

Termi = a'^ pjf{Q^^\kT)), 
jej 

Term2 = E E (^r)) - . 

76-7 A£C(+(kT) 

Let X* be the optimal solution to the static partitioning 
problem. The rest of the proof is similar to the Lyapunov 
analysis in the proof of Theorem (step 3), i.e., for any 
0 < 5 < ( 5 *, 

Termi < a E E x\fiQ^^\kT)) 

jej AeA(+ 

Term 2 < ^ ^ (^{1 + 5)x* a^ (af{Q^^\kT) - 

jSJ A6A0) 

Putting everything together, 

— Afe-pEs(kT) [E E ^a^] — (1 + ^)*^(**) 

feJ Aec<^3) 

-aSp^i„ ^ f{Q^^\kT)) + aKi. 
j&J 

Then it follows from the Foster-Lyapunov theorem that the 
Markov chain veS{kT), A: = 0,1, 2, ■ • • (and therefore Markov 
chain S{t),t > 0) is positive recurrent. As in the step 3 in 
the proof of of Theorem we take the expectation from 
both sides of the above equality with respect to S{kT), 
and then sum over k = 0,..,N — 1, divide by N, and let 
A —>■ 00 . Then the statement of the theorem follows by 
choosing Bi — K 2 -I- K 3 and B 2 = A| . 



B. AN ALTERNATIVE DESCRIPTION OF 
DYNAMIC GRAPH PARTITIONING AL¬ 
GORITHM 

The Dynamic Graph Partitioning (DGP) algorithm, as 
described in Section does not require any dedicated clock 
as the decisions are made at the instances of job arrival and 
departure. In this section, we present an alternative de¬ 
scription of the algorithm by using dedicated clocks. Each 
queue is assigned an independent Poisson clock of rate 
{h-l-q{t))//3^ where A is a fixed constant depending on 
how fast the iterations in the algorithm can be performed. 
Equivalently, at each time t, the time duration until the tick 
of the next clock is an exponential random variable with 
parameter This means if changes at 

time t' > t before the clock makes a tick, the time duration 
until the next tick is reset to an independent exponential 
random variable with parameter ))/P^ q’jjg jjg_ 

scription of the algorithm is given below. 


Algorithm 4 Alternative Dynamic Graph Partitioning 

(ADGP) Algorithm 

At the instances of dedicated clocks. 

Suppose the dedicated clock of queue makes a tick, 

then: 

1: A virtual template is chosen randomly from cur¬ 
rently feasible templates for graph Qj , given the current 
configuration, using Random Partition Procedure, if pos¬ 
sible. Then this template is added to the conhguration 

with probability e P ^ and discarded otherwise. The 
virtual template leaves the system after an exponentially 
distributed time duration with mean l//rj. 

2: If there is a job of type j in waiting to get service, 
and a virtual template of type j is created in step 1 , 
this virtual template is hlled by a job from which 
converts the virtual template to an actual template. 

At arrival instances. 

1: Suppose a graph (job) of type Qj arrives. The job is 
added to queue 

At departure instances. 

1: At the departure instances of actual/vitual templates, 
the algorithm removes the corresponding template from 
the configuration. 

2: If this is a departure of an actual template, the job is 
departed and the corresponding queue is updated. 


for the following distribution 7 

( 33 ) 

where is the normalizing constant. 


Similarly to the DGP(IT) algorithm, as /3 —>■ 00 , the opti¬ 
mizing TV converges to 7 . The distribution 7 has the inter¬ 
pretation of the steady state distribution of configurations 
in a loss system with arrival rates Aj = \, j £ J, and ser¬ 
vice rates fij = fj,j, j £ J■ In the loss system, when a graph 
arrives, it is randomly distributed over the machines if pos¬ 
sible; otherwise it is dropped. At the departure instances, 
the job and hence its template leave the system. 

Proof of Proposition The proof is basically identi¬ 
cal to the proof of Proposition The only difference is that 
the detailed balance equations are given by 


• 7 r(C' © 


Vc/«3<T)/(3 

\A(A{C)\ ® 


O) 

A 


for any configuration C and C © £ C\ 

j £ J■ Here the LHS is the departure rate of (virtual or 
actual) template from the configuration C © A^^\ The 
RHS is the rate at which the (actual or virtual) template A*--’^ 
for graphs Qj is added to configuration C (the Random par¬ 
tition Procedure selects a template A £ uniformly 

at random). Thus the detailed balanced equations are given 
by 


■k{C®A^^^) = 7r(C) 




|A(A(g)| 


ef< ^ . 


(34) 


and it is easy to see that (15 I with 7 replaced with 7 in 


(331, indeed satisfies the detailed balance equations, with 
the normalizing condition that t(C') = 1. The fact that 
that this distribution maximizes the stated objective func¬ 
tion follows in parallel with the arguments in the proof of 
Proposition □ 


The algorithm will yield average queue size and partition¬ 
ing cost performance similar to those in Theorem The 
proof essentially follows the three steps of the proof of The¬ 
orem Here, we only describe the main property of the Al¬ 
ternative Dynamic Graph Partitioning algorithm with fixed 
weights, which we refer to as ADGP(IT) (the counterpart 
DGP(IT) in Section 1^. 

Proposition 4. Under ADGP(W), the steady state dis¬ 
tribution of configurations solves 

maxE,r [^y~^ - , 0 Dkt(t II 7 ) 












